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1. Introduction In sequential decision problems a decision maker (or forecaster) tries to predict the 
outcome of a certain unknown process at each (discrete) time instance and takes an action accordingly. 
Depending on the outcome of the predicted event and the action taken, the decision maker receives a 
reward. Very often, probabilistic modeling of the underlying process is difficult. For such situations the 
prediction problem can be formalized as a repeated game between the decision maker and the environ- 
ment. This formulation goes back to the 1950's when Hannan [TB] and Blackwell [Sj showed that the 
decision maker has a randomized strategy that guarantees, regardless of the outcome sequence, an average 
asymptotic reward as high as the maximal reward one could get by knowing the empirical distribution 
of the outcome sequence in advance. Such strategies are called Hannan consistent. To prove this result, 
Hannan and Blackwell assumed that the decision maker has full access to the past outcomes. This case is 
termed the full information or the perfect monitoring case. However, in many important applications, the 
decision maker has limited information about the past elements of the sequence to be predicted. Various 
models of limited feedback have been considered in the literature. Perhaps the best known of them is 
the so-called multi-armed bandit problem in which the forecaster is only informed of its own reward but 
not the actual outcome; see Bahos [3], Megiddo [23], Foster and Vohra [2], Auer, Cesa-Bianchi, Freund, 
and Schapire pQ, Hart and Mas Colell [T7J [TS] ■ For example, it is shown in [T] that Hannan consistency 
is achievable in this case as well. 

Sequential decision problems like the ones considered in this paper have been studied in different 
fields under various names such as repeated games, regret minimization, on-line learning, prediction 
of individual sequences, and sequential prediction. The vocabulary of different sub-communities differ. 
Ours is perhaps closest to that used by learning theorists. For a general introduction and survey of the 
sequential prediction problem we refer to Cesa-Bianchi and Lugosi [TU] . 

In this paper we consider a general model in which the information available to the forecaster is a 
general given (possibly randomized) function of the outcome and the decision of the forecaster. It is 
well understood under what conditions Hannan consistency is achievable in this setup, see Piccolboni 
and Schindelhauer [25] and Cesa-Bianchi, Lugosi, and Stoltz [11]. Roughly speaking, this is possible 
whenever, after suitable transformations of the problem, the reward matrix can be expressed as a linear 
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function of the matrix of (expected) feedback signals. However, this condition is not always satisfied 
and then the natural question is what the best achievable performance for the decision maker is. This 
question was answered by Rustichini [26J who characterized the maximal achievable average reward that 
can be guaranteed asymptotically for all possible outcome sequences (in an almost sure sense). 

However, Rustichini's proof of achievability is not constructive. It uses abstract approachability theo- 
rems due to Mertens, Sorin, and Zamir [24) and it seems unlikely that his proof method can give rise to 
computationally efficient prediction algorithms, as noted in the conclusion of [26j . A simplified efficient 
approachability-based strategy in the special case where the feedback is a function of the action of nature 
alone was shown in Mannor and Shimkin [22] . In the general case, the simplified approachability-based 
strategy of [25] falls short of the maximal achievable average reward characterized by Rustuchini [21)] . 
The goal of this paper is to develop computationally efficient forecasters in the general prediction problem 
under imperfect monitoring that achieve the best possible asymptotic performance. 

We introduce several forecasting strategies that exploit some specific properties of the problem at 
hand. We separate four cases, according to whether the feedback signal only depends on the outcome or 
both on the outcome and the forecaster's action and whether the feedback signal is deterministic or not. 
We design different prediction algorithms for all four cases. 

As a by-product, we also obtain finite-horizon performance bounds with explicit guaranteed rates of 
convergence in terms of the number n of rounds the prediction game is played. In the case of deterministic 
feedback signals these rates are optimal up to logarithmic factors. In the random feedback signal case we 
do not know if it is possible to construct forecasters with a significantly smaller regret. 

A motivating example for such a prediction problem arises naturally in multi-access channels that are 
prevalent in both wired and wireless networks. In such networks, the communication medium is shared 
between multiple decision makers. It is often technically difficult to synchronize between the decision 
makers. Channel sharing protocols, and, in particular, several variants of spread spectrum, allow multiple 
agents to use the same channel (or channels that may interfere with each other) simultaneously. More 
specifically, consider a wireless system where multiple agents can choose in which channel to transmit 
data at any given time. The quality of each channel may be different and interference from other users 
using this channel (or other "close" channels) may affect the base-station reception. The transmitting 
agent may choose which channel to use and how much power to spend on every transmission. The agent 
has a tradeoff between the amount of power wasted on transmission and the cost of having its message 
only partially received. The transmitting agent may not receive immediate feedback on how much data 
were received in the base station (even if feedback is received, it often happens on a much higher layer 
of the communication protocol). Instead, the transmitting agent can monitor the transmissions of the 
other agents. However, since the transmitting agent is physically far from the base-station and the other 
agents, the information about the channels chosen by other agents and the amount of power they used is 
imperfect. This naturally abstracts to an online learning problem with imperfect monitoring. 

The paper is structured as follows. In the next section we formalize the prediction problem we inves- 
tigate, introduce the target quantity, that is, the best achievable reward, and the notion of regret. In 
Section [3] we describe some analytical properties of a key function p, defined in Section [2j This function 
represents the worst possible average reward for a given vector of observations and is needed in our 
analysis. In Section 0] we consider the simplest special case when the actions of the forecaster do not 
influence the feedback signal, which is, moreover, deterministic. This case is basically as easy as the full 
information case and we obtain a regret bound of the order of •nT 1 !' 1 (with high probability) where n is 
the number of rounds of the prediction game. In Section [5] we study random feedback signals but still 
with the restriction that it is only determined by the outcome. Here we are able to obtain a regret of 
the order of n _1 / 4 \/logn. The most general case is dealt with in Section [6] The forecaster introduced 
there has a regret of the order of n _1 ' 5 yTogn. Finally, in Section [7] we show that this may be improved 
to Oin- 1 / 3 ) in the case of deterministic feedback signals, which is known to be optimal (see 

2. Problem setup, notation The randomized prediction problem is described as follows. Consider 
a sequential decision problem in which a forecaster has to predict an outcome that may be thought of as 
an action taken by the environment. 

At each round, t — 1,2, ... ,n, the forecaster chooses an action i 6 {f , . . . , N} and the environment 
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chooses an action j g {1, . . . , M} (which we also call an "outcome"). The forecaster's reward r(i,j) is the 
value of a reward function r : {1, . . . , N} x {1, . . . , M} — > [0, 1]. Now suppose that, at the t-th round, the 
forecaster chooses a probability distribution p t — (pi,{, . . . ,PN,t) over the set of actions, and plays action 
i with probability p^. We denote the forecaster's (random) action at time t by 1%. If the environment 
chooses action J t £ {1, . . . , M}, then the reward of the forecaster is r(I t , J t ). The prediction problem is 
defined as follows: 



Randomized prediction with perfect monitoring 

Parameters: number N of actions, cardinality M of outcome space, reward function r, number n 
of game rounds. 

For each round t = 1,2, ... ,n, 

(1) the environment chooses the next outcome Jt; 

(2) the forecaster chooses p t and determines the random action It, distributed according to p t ; 

(3) the environment reveals J*; 

(4) the forecaster receives a reward r(I t , Jt). 



Note in particular that the environment may react to the forecaster's strategy by using a possibly 
randomized strategy. Below, the probabilities of the considered events are taken with respect to the 
forecaster's and the environment's randomized strategies. The goal of the forecaster is to minimize the 
average regret and to enforce that 



/ 1 n ^ n \ 

limsup max -VV(i,J t ) VV(/ t ,J t ) 



< 



that is, the per-round realized differences between the cumulative reward of the best fixed strategy 
i £ {1, . . . , N}, in hindsight, and the reward of the forecaster, are asymptotically non positive. Denoting 
by f(p,j) = 5Zi=i Pi r (h j) the linear extension of the reward function r, the Hoeffding-Azuma inequality 
for sums of bounded martingale differences (see [19], [3]), implies that for any S £ (0, 1), with probability 
at least 1 — S, 

n n Y 

-E^^-E^^-^^ 

t—i t=i 

so it suffices to study the average expected reward (1/n) Et=i r (Pti Jt)- Hannan [TJ] and Blackwell [5] 
were the first to show the existence of a forecaster whose regret is o(l) for all possible behaviors of 
the opponent. Here we mention a simple yet powerful forecasting strategy known as the exponentially 
weighted average forecaster. This forecaster selects, at time t, an action I t according to the probabilities 



P l ,t = — t — x- , i = l,...,N, 



where n > is a parameter of the forecaster. One of the basic well-known results in the theory of 
prediction of individual sequences states that the regret of the exponentially weighted average forecaster 
is bounded as 

1 1 m N r\ 

»=l, ...,N n ^— ' n ^— ' nn 8 

t=i t=i ' 



With the choice rj = y/8 In N/n the upper bound becomes y/lnN/(2n). Different versions of this re- 
sult have been proved by Littlestone and Warmuth [2"T] . Vovk [23 EH], Cesa-Bianchi, Freund, Haussler, 
Helmbold, Schapire, and Warmuth [8], Cesa-Bianchi [7], see also Cesa-Bianchi and Lugosi [9]. 

In this paper we are concerned with problems in which the forecaster does not have access neither to 
the outcomes Jt nor to the rewards r(i,J t ). The information available to the forecaster at each round 
is called the feedback signal. These feedback signals may depend on the outcomes Jt only or on the 
action-outcome pairs (It,Jt) and may be deterministic or drawn at random. In the simplest case when 
the feedback signal is deterministic, the information available to the forecaster is St = h(I t , Jt), given by a 
fixed (and known) deterministic feedback function h : {!,..., N} X {1, . . . , M} — ► <S where S is the finite 
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Randomized prediction under imperfect monitoring 

Parameters: number N of actions, number M of outcomes, reward function r, random feedback 
function H, number n of rounds. 

For each round t = 1, 2 . . . , n, 

(i) the environment chooses the next outcome Jt £ {1, . . . , M} without revealing it; 

(ii) the forecaster chooses a probability distribution p t over the set of N actions and draws an 
action I t € {1, . . . , N} according to this distribution; 

(iii) the forecaster receives reward r(/ t , J t ) and each action i gets reward r(i,J t ), but none of 
these values is revealed to the forecaster; 

(iv) a feedback signal s t drawn at random according to H(I t , J t ) is revealed to the forecaster. 



Figure 1: The game of randomized prediction under imperfect monitoring 

set of signals. In the most general case, the feedback signal is governed by a random feedback function of 
the form H : {1, . . . , N} x {1, . . . , M} — > V(S) where V(S) is the set of probability distributions over the 
signals. The received feedback signal st is then drawn at random according to the probability distribution 
H(It, Jt) by using an external independent randomization. 

To make notation uniform throughout the paper, we identify a deterministic feedback function h : 
{1, . . . , N} x {1, ... , M} -> S with the random feedback function H : {1, . . . , N} x {1, . . . , M } -> V{S) 
which, to each pair assigns §h(i,j) where 6 S is the probability distribution concentrated on the single 

element se5. 

The sequential prediction problem under imperfect monitoring is formalized in Figure [TJ 

In many interesting situations the feedback signal the forecaster receives is independent of the fore- 
caster's action and only depends on the outcome, that is, for all j — 1, . . . , M, H(-,j) is constant. In 
other words, H depends on the outcome J t but not on the forecaster's action I t . We will see that the 
prediction problem becomes significantly simpler in this special case. To simplify notation in this case, 
we write H{Jt) = H(It,Jt) for the feedback signal at time t (h(J t ) = h(It,Jt) in case of deterministic 
feedback signals). This setting includes the full- information case (when the outcomes Jt are revealed) 
but also the case of noisy observations (when a random variable with distribution depending only on J t 
is observed), see Weissman and Merhav [53], Weissman, Merhav, and Somekh-Baruch [3D]. 

Next we describe a reasonable goal for the forecaster and define the appropriate notion of consistency. 
To this end, we introduce some notation. If p = (px, . . . ,Pn) an( l q = (<?x> • • • >Qm) are probability 
distributions over {1, . . . , N} and {1, . . . , M}, respectively, then, with a slight abuse of notation, we write 

N M 
i=i j=i 

for the linear extension of the reward function r. We also extend linearly the random feedback function 
in its second argument: for a probability distribution q — (qi, . . . , q^i) over {1, . . . , M}, define the vector 
mP(S) 

M 

H(i,q)=J2 qj H{i,j) , i = l,...,N. 

j'=i 

Denote by T the convex set of all TV-vectors H(-, q) = (if (1, q), . . . , H(N, q)) of probability distributions 
obtained this way when q varies. (J- C T'(S) N is the set of feasible distributions over the signals). In 
the case when the feedback signals only depend on the outcome, all components of this vector are equal 
and we denote their common value by H(q). We note that in the general case, the set T is the convex 
hull of the M vectors H(-,j). Therefore, performing a Euclidean projection on T can be done efficiently 
using quadratic programming. 

To each probability distribution p over {!,..., N} and probability distribution A 6 J 7 , we may assign 
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the quantity 

pip, A) = min r(p, q) , 

q:H(-,q)=A 

which is the reward guaranteed by the mixed action p of the forecaster against any distribution of the 
outcomes that induces the given distribution of feedback signals A. Note that p £ [0, 1] and that p is 
concave in p (since it is an infimum of linear functions; since this infimum is taken on a convex set, the 
infimum is indeed a minimum). Finally, p is also convex in A as the condition defining the minimum is 
linear in A. 

To define the goal of the forecaster, let q n denote the empirical distribution of the outcomes Ji, . . . , J n 
up to round n. This distribution may be unknown to the forecaster since the forecaster observes the 
signals rather than the outcomes. The best the forecaster can hope for is an average reward close 
to maxp p(p, H(-, q n )). Indeed, even if H(-,q n ) was known beforehand, the maximal expected reward 
for the forecaster would be max p p(p, H(-, q n )), simply because without any additional information the 
forecaster cannot hope to do better than against the worst element which is equivalent to q as far as the 
signals are concerned. 

Based on this argument, the (per-round) regret R n is defined as the average difference between the 
obtained cumulative reward and the target quantity described above, that is, 

1 - 

R n = max pO, #(•,<?„)) Vr(/ ( ,Ji) . 

v n 

Rustichini [26] proves the existence of a forecasting strategy whose per-round regret is guaranteed to 
satisfy limsup^^^ R n < with probability one, for all possible imperfect monitoring problems. 

Rustichini's proof is not constructive but in several special cases constructive and computationally 
efficient prediction algorithms have been proposed. Among the partial solutions proposed so far, we 
mention Piccolboni and Schindelhauer [55] and Cesa-Bianchi, Lugosi, and Stoltz [TT] who study the case 
when 

1 " 

max p(p,H{-,q n )) = max r(i,q n ) = max - V r(i, J t ) . 
p i=i,. ...n i=i....,n n * — ' 

t=i 

In this case strategies with a vanishing per-round regret are called Hannan consistent. In such cases the 
feedback is sufficiently rich so that one may achieve the same asymptotic reward as in the full information 
case, although the rate of convergence may be slower. This case turns out to be considerably simpler to 
handle than the general problem and computationally tractable explicit algorithms have been derived. 
Also, it is shown in [TT] that in this case it is possible to construct strategies whose regret decreases 
at a rate of n" 1 / 3 (with high probability) and that this rate of convergence cannot be improved in 
general. (Note that Hannan consistency is achievable, for example, in the adversarial multi-armed bandit 
problem, see Remark IB. II in the Appendix.) Mannor and Shimkin |22) construct an approachability- 
based algorithm with vanishing regret for the special case where the feedback signals depend only on the 
outcome. In addition, Mannor and Shimkin discuss the more general case of feedback signals that depend 
on both the action and the outcome and provide an algorithm that attains a relaxed goal comparing to 
the one attained in this work. 

The following example demonstrates the structure of the model. 

Example 2.1 Consider the simple game where N = 2, M = 3, S = {a, b}, and the reward and feedback 
functions are as follows. The reward function is described by the matrix 

"10 0" 

ill 

. 2 2 2 . 

To identify the possible distributions of the feedback signals we need to specify some elements of V{S). We 
describe such a member of V(S) by the probability of observing a. The feedback function is parameterized 
by some e > and is then given by 

"1 l-E " 
1 1-e ' 

In words, outcome 1 leads to a deterministic feedback signal of a, outcome 3 leads to a deterministic 
feedback signal of b, and outcome 2 leads to a feedback signal of a with probability 1 — e and b with 
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Figure 2: The function A h- > max p p(p, A) for Example 1 2.1 



probability e. Note that the feedback signals depend only on the outcome and not on the action taken. 
We recall that A, as a member of V(S), is identified with the probability of observing the feedback signal 
a and it follows that T is the interval [0, 1]. We now compute the function p. Letting p denote the 
probability of selecting the first action (i.e., p = (p, 1 — £>)), we have 



p(p,A) = min (pqi + (l-p) = min pq 1 

q: 9 i + (l-e)g 2 =A V 2 J q : eg 1 -(l- E )g 3 =A-(l-e) 




1-p 



P 

Optimizing over p, we obtain 

( - for A < 1 - e/2, 

maxp(p, A) = < 2 

{ for 1 - e/2 < A < 1. 

The intuition here is that for A — 1 there is certainty that the outcome is 1 so that an action of p = 1 is 
optimal. For A < 1 — e the forecaster does not know if the outcome was consistently 2 or some mixture 
of outcomes 1 and 3. By playing the second action, the forecaster can guarantee a reward of 1/2. The 
function A t— > max p p{p 1 A) is depicted in Figure O 



In this paper we construct simple and computationally efficient strategies whose regret vanishes with 
probability one. The main idea behind the forecasters we introduce in the next sections is based on 
the gradient-based strategies described, for example, in Cesa-Bianchi and Lugosi [TU1 Section 2.5]. Our 
forecasters use sub-gradients of concave functions. In the next section we briefly recall some basic facts 
on the existence, computation, and boundedness of these sub-gradients. 

3. Some analytical properties of p For a concave function / defined over a convex subset of R d , 
a vector b(x) € H d is a sub-gradient of / at x if f(y) — f{x) < b(x) ■ (y — x) for all y in the domain of /. 
We denote by df(x) the set of sub-gradients of / at x which is also known as the sub-differential. Sub- 
gradients always exist, that is, df(x) is non-empty in the interior of the domain of a concave function. 
In this paper, we are interested in sub-gradients of concave functions of the form /(•) = /?(•, A t ), where 
At is an observed or estimated distribution of feedback signal at round t. (For instance, in Section [51 
At = Sh(j t ) is observed, in the other sections, it will be estimated.) In view of the exponentially weighted 
update rules that are used below, we only evaluate these functions in the interior of the definition domain 
(the simplex). Thus, the existence of sub-gradients is ensured throughout. 

In the general case, sub-gradients may be computed efficiently by the simplex method. However, 
their computation is often even simpler, as in the case described in Section |4l that is, when one faces 
deterministic feedback signals not depending on the actions of the forecaster. Indeed, at round t, it 
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is trivial whenever p i— > p{p,o"h(j t )) is differentiable at the considered point p t since it is differentiable 
exactly at those points at which it is locally linear, and thus the gradient equals the column of the reward 
matrix corresponding to the outcome yt for which r(p tl y t ) — p{p t ,8h(j t ))- But because p(-,Sh(j t )) is 
concave, the Lebesgue measure of the set where it is non-differentiable equals zero. It thus suffices to 
resort to the simplex method only at these points to compute the sub-gradients. 

Note that the components of the sub-gradients are always bounded by a constant that depends on 
the game parameters. This is the case since the p(-,A t ) are concave and continuous on a compact set 
and are therefore Lipschitz, leading to a bounded sub-gradient. In the sequel, we denote by K the value 
sup p sup A sup bedp r p M \\b\\oo where dp(p, A) denotes the sub-gradient at p of the concave function A) 
with A fixed. This constant depends on the specific parameters of the game. Since the parameters of 
the game are supposed to be known to the forecaster, in principle, the forecaster can compute the value 
of K. In any case, the value of K can be bounded by the supremum norm of the payoff function as the 
following lemma asserts. 

Lemma 3.1 The constant K satisfies K < 1. 

PROOF. Fix A and consider Z A = {q : H(-,q) = A}. Define ip : (p,q) E R n x Z A i-> <p(p,q) € R 
as the linear extension-restriction of r to R™ x Z A , that is <p(p,q) = Yli ■ p%qjr(i, j). Further, let 
Zq(p) — {q '■ <p(PiQ) — mm qez A *?)}■ ^ follows that under our notation, for any probability 
distribution p, one has p(p, A) = min^^A <p(p,q). Now, from Danskin's theorem (see, e.g., Bertsekas 
[5]) we have that the sub-differential satisfies 

9p( P ,A)=conv(^^: Z eZ A (p)) 

where conv(^4) denotes the convex hull of a set A. Since r(i,j) € [0, 1], it follows that \\dp(p, z)/dp\\ 0O < 1 
for all z G Z A . Since the convex hull does not increase the infinity norm, the result follows. □ 

Remark 3.1 The constant K for the game described in Example l2.1l is 1/2. However, the gradient of the 
function max p p(p, A) as a function of A is 1/e. This happens because the p that attains the maximum 
changes rapidly in the interval [1 — e/2, 1]. We further note that K may be much smaller than 1. Since 
our regret bounds below depend on K linearly, having a tighter bound on K can lead to considerable 
convergence rate speedup; see Remark |4. II 

4. Deterministic feedback signals only depending on outcome We start with the simplest 
case when the feedback signal is deterministic and does not depend on the action I t of the forecaster. 
In other words, after making the prediction at time t, the forecaster observes h(J t ). This simplifying 
assumption may be naturally satisfied in applications in which the forecaster's decisions do not effect the 
environment. 

In this case, we group the outcomes according to the deterministic feedback signal they are associated 
to. Each signal s is uniquely associated to a group of outcomes. This situation is very similar to the case 
of full monitoring except that rewards are measured by p and not by r. This does not pose a problem 
since r is lower bounded by p in the sense that for all p and j, 

r(p,j) > p{p,5 h(J) ) . 

As mentioned in the previous section, we introduce a forecaster based on the sub-gradients of p(-, 
£ = 1,2,.... The forecaster requires a tuning parameter ry > 0. The z-th component of p t is 

vT, t s Z 1 i(r(p s ,S h (j s ))) 

where (r(p s , 6h(j s ))) { is the i-th component of any sub-gradient r(p s ,8h(j s )) € dp(p s ,8f l ^j s )) of the 
concave function p(-,Sh(j s ))- This forecaster is inspired by a gradient-based predictor introduced by 
Kivinen and Warmuth 20J. 

The regret is bounded as follows. Note that the following bound and the considered forecaster coincide 
with those of |T]) in case of perfect monitoring. (In that case, p(-,5h(j)) — the sub-gradients are 

given by r.) 
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Proposition 4.1 For all r\ > 0, for all strategies of the environment, for all 8 > 0, the above strategy of 
the forecaster ensures that, with probability at least 1 — 8, 



\nN K 2 r, / 1 1 

where K is the bound on the sub-gradients considered above. In particular, choosing n ~ y/(\n N)/n 
yields R n = 0(n-^ 2 y/MN/S)). 



Remark 4.1 The optimal choice of r\ in the upper bound is K^/2(lnN)/n, which depends on the param- 
eters K and n. While the bound K < 1 is available, this bound might be loose. Sometimes the forecaster 
does not necessarily know in advance the number of prediction rounds and/or the value of K may be 
difficult to compute. In such cases one may estimate on-line both the number of time rounds and K , 
using the techniques of Auer, Cesa-Bianchi, and Gentile {If and Cesa-Bianchi, Mansour, and Stoltz \12f 
as follows. Writing 

K t = max ||f(p s A(J a ))IU , 



and introducing a round- dependent choice of the tuning parameter rj = r/ t = CK t -J (In TV) jt for a properly 
chosen constant C , one may prove a regret bound that is a constant multiple of K n \J (In N/S) /n (that hold 
with probability at least 1 — 8). Since the proof of this is a straightforward combination of the techniques 
of the above-mentioned papers and our proof, the details are omitted. 



PROOF. Note that since the feedback signals are deterministic, H{q n ) takes the simple form H(q n ) = 
h T,t=l S h(J t )- Now : for an y P, 

n 

np(p,H(q n )) -£V(p t , J t ) 

t-i 

n 

< np(p, H(q n )) — p(Pti &h{Jt)) (by the lower bound on r in terms of p) 

t=i 

n 

< {p(Pi &h(j t )) ~ PiPti <^i(J t ))) (by convexity of p in the second argument) 
t=i 

n 

< r{p t , 8h(j t )) ■ (p — p t ) (by concavity of p in the first argument) 
t=i 

In N nK 2 ri 

< + — s- 1 (by ©, after proper rescaling), 

T) 2 

where at the last step we used the fact that the forecaster is just the exponentially weighted average 
predictor based on the rewards (r(p s) and that all these reward vectors have components be- 

tween —K and K . The proof is concluded by the Hocffding-Azuma inequality, which ensures that, with 
probability at least 1 — 8, 

n n I T~ 

EK^ t )>E r (p*' J ')-v? ln A ■ (2) 

□ 



5. Random feedback signals depending only on the outcome Next we consider the case when 
the feedback signals do not depend on the forecaster's actions, but, at time t, the signal St is drawn at 
random according to the distribution H(J t ). In this case the forecaster does not have a direct access to 

1 - 

H{q n )^-Y,H{J t ) 

t=i 

anymore, but only observes the realizations St drawn at random according to H(J t ). In order to overcome 
this problem, we group together several consecutive time rounds (say, m of them) and estimate the 
probability distributions according to which the signals have been drawn. 
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Parameters: Integer ni > 1, real number r\ > 0. 




Initialization: w° = (1, . . . , 1). 




For each round t = 1,2,... 




(i) If bm + 1 < t < (b + l)m for some integer b, choose the distribution p t = 


= p b given by 


b K 
Pk,t=P k = N 




and draw an action I t from {1, . . . , TV} according to it; 




(ii) if t — (b -t- 1)771 for some integer 6, perform the update 




w j>+! = w b e " (K^"))* for each k = 1, . . . ,N, 




where for all A, r(-, A) is a sub-gradient of p(-, A) and A b is defined in 





Figure 3: The forecaster for random feedback signals depending only on the outcome. 



To this end, denote by II the Euclidean projection onto T (since the feedback signals depend only on 
the outcome we may now view the set T of feasible distributions over the signals as a subset of V(S), 
the latter being identified with a subset of R' 5 ' in a natural way). Let m, 1 < m < n, be a parameter of 
the algorithm. For b = 0, 1, . . ., we denote 

/ (b+l)m \ 

A» = n - £ U . 0) 

For the sake of the analysis, we also introduce 

(b+l)m 

m ^— ' 

t=6m+l 

The proposed strategy is described in Figure [3l Observe that the practical implementation of the fore- 
caster only requires the computation of (sub)gradients and of £2 projections, which can be done in poly- 
nomial time. The next theorem bounds the regret of the strategy which is of the order of n _1 / 4 \/logn. 
The price we pay for having to estimate the distribution is thus a deteriorated rate of convergence (from 
the 0{n~ 1 ^ 2 ) obtained in the case of deterministic feedback signals). We do not know whether this rate 
can be improved significantly as we do not know of any nontrivial lower bound in this case. 

Theorem 5.1 For all integers m > 1, for all 77 > 0, and for all 5 > 0, the regret for any strategy of the 
environment is bounded, with probability at least 1 — (n/m + 1)5, by 

r 1 f~~2 mlniV K 2 n m I 1 ' 1 
■s/m V nrj I n \ Zn 

where K < 1 and L are constants that depend only on the parameters of the game. The choices m = \^/n\ 
and r\ ~ yj (m In N) / n imply R n = 0(n -1 / 4 -y/ln(niV / '5)) with probability of at least 1 — 5. 

Remark 5.1 Here again, K and L may, in principle, be computed or bounded (see Lemma \3.1\ and 
Remark \A.l\) by the forecaster. If the horizon n is known in advance (as it is assumed in this paper), 
the values of rj and m may be chosen to optimize the upper bound for the regret. Observe that while 
one always have K < 1, the value of L (i.e., the Lipschitz constant of p in its second argument) can be 
arbitrarily large, see Example \2.1l If the horizon n is unknown at the start of the game, the situation 
is not as simple as in Section^ (see Remark \4.1\ ), because now a time- dependent choice of rj needs to 
be accompanied by an adaptive choice of the parameter m as well. A simple, though not very attractive, 
solution is the so-called "doubling trick" (see, e.g., [10, p. 17]). According to this solution, time is divided 
into periods of exponentially growing length and in each period the forecaster is used as if the horizon 
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was the length of the actual period. At the end of each period the forecaster is reset and started again 
with new parameter values. It is easy to see that this forecaster achieves the same regret bounds, up to a 
constant multiplier. We believe that a smoother solution should also work (as in Remark \4-.1\ ). Since this 
seems like a technical endeavor we do not pursue this issue further. 

Proof. We start by grouping time rounds m by m. For simplicity, we assume that n = (B + l)m 
for some integer B; if this is not the case, we consider the lower integer part of n and bound the regret 
suffered in the last at most to — 1 rounds by m (this accounts for the m/n term in the bound). For all p, 

np{p,H(q n )) -^2r(p t , J t ) = n p(p, H(q n )) -^mr \p b \ — E foA 

t=l 6=0 V t=bm+l J 

< E [mp(p,A b )-mr \p\- E 5j t 

6=0 y y t=bm+l 

< mJ2(p(P^ b )-p(p b ^ b )) > 

6=0 

where we used the definition of the algorithm, convexity of p in its second argument, and finally, the 
definition of p as a minimum. We proceed by estimating A b by A b . By a version of the Hoeffding-Azuma 
inequality for sums of Hilbert space- valued martingale differences proved by Chen and White [131 Lemma 
3.2], and since the £2 projection can only help, for all b, with probability at least 1 — 5, 



A" - A" 



<21n§ 



< 

2 y to 

By Proposition lA.il p is uniformly Lipschitz in its second argument (with constant L), and therefore we 
may further bound as follows. With probability 1 — (B + 1)5, 



B I , / 21n 2 



^( P (p,A b )-p(p b ,A b )) < TOEL(p,A 6 )-p(p b ,A 6 )+2L 1 



TO 

6-11 6=0 

B 

m 

6=0 



^(p(p,A b ) -p(p h ,A b )) +2L(i? + l)W2mln 



The term containing (B + l)y/m = n/y/m is the first term in the upper bound. The remaining part is 
bounded by using the same slope inequality argument as in the previous section (recall that r denotes a 
sub-gradient) , 

B B 

mE( P (p,A b ) -/>(>,£")) < TO^Tf(V,A b ) ■ (p-p b ) 

6=0 6=0 

'IniV (B + l)K 2 n\_ m\nN nK 2 n 



< TO 

v 2 J V 

where we used Theorem [1] and the boundedness of the function r between —K and K. The proof is 
concluded by the Hoeffding-Azuma inequality which, as in ([2]), gives the final term in the bound. The 
union bound indicates that the obtained bound holds with probability at least l — (B+2)5 > l — (n/m+l)5. 
□ 

6. Random feedback signals depending on action outcome pair We now turn to the general 
case, where the feedback signals are random and depend on the action-outcome pairs (It, Jt). The key 
is, again, to exhibit efficient estimators of the (unobserved) H(-,q n ). 

Denote by II the projection, in the Euclidian distance, onto T (where J- , as a subset of (V(S)) N , is 
identified with a subset of Hi' 5 ' n ). For b = 0, 1, . . ., denote 



A b = 



/ 1 (h+1)m r~ i \ 

1 TO e— ' L J 1=1 N I 



TO 

t=6m+l 
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Parameters: Integer m > 1, real numbers rj,j > 0. 
Initialization: w° = (1, . . . , 1). 

For each round t = 1, 2, . . . 

(i) if bra + 1 < t < (b+ l)m for some integer b, choose the distribution p t = p b = (1 — j)p b + ju, 
where p b is defined component-wise as 



~b 

Pk 



and u denotes the uniform distribution, u = (1/N, . . . , 1/iV); 

(ii) draw an action I t from {1, . . . , N} according to it; 

(iii) if t = (b + l)m for some integer b, perform the update 

w \+ x = w l e n (r(p b ^ b )) k for each k = 1, . . . , N, 
where for all A e f, r(-, A) is a sub-gradient of p(-, A) and A b is defined in (H)) 



Figure 4: The forecaster for random feedback signals depending on action-outcome pair. 

where the distribution H(i, J t ) of the random signal St received by action i at round t is estimated by 

hi.t = — —ti t =i ■ 
Pi,t 

(This form of estimators is reminiscent of those presented, e.g., in (TJ [25l [TT].) We prove that the hij 
are conditionally unbiased estimators. Denote by Et the conditional expectation with respect to the 
information available to the forecaster at the beginning of round t. This conditioning fixes the values of 
p t and Jt- Thus, 



E, 



hit 



= — E t [S at l Jt=i ] = — E t [H(I t ,J t )li t =i] = —H(i,J t )pi, t = H(i,J t ) . 
Pi,t Pi,t Pi,t 



For the sake of the analysis, introduce 

(b+\)m 
t=bm+l 

The proposed forecasting strategy is described in Figure [4] The mixing with the uniform distribution is 
needed, similarly to the forecasters presented in [TJ[2S1[TT], to ensure sufficient exploration of all actions. 
Mathematically, such a mixing lower bounds the probability of pulling each action, which will turn to be 
crucial in the proof of Theorem 16.11 

Here again, the practical implementation of the forecaster only requires the computation of 
(sub)gradients and of £2 projections, which can be done efficiently. The next theorem states that the 
regret in this most general case is at most of the order of nT 1 ^ y/logn. Again, we do not know whether 
this bound can be improved significantly. We recall that K denotes an upper bound on the infinity norm 
of the sub-gradients (see Lemma l3.1j) . The issues concerning the tuning of the parameters considered in 
the following theorem are similar to those discussed after the statement of Theorem 15.11 in particular, 
the simplest way of being adaptive in all parameters is to use the "doubling trick" . 

Theorem 6.1 For all integers m > 1, for all r] > 0, 7 € (0, 1), and S > 0, the regret for any strategy of 
the environment is bounded, with probability at least 1 — (n/m + 1)6, as 



R n < 2iivJiMl n ™ + 2L ^!^ ln ™ 
y 7771 37m 



mlniV K 2 n 

+ — ^ + 2^7 + - + W — In - , 

nr\ 2 n \ Zn 
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where L and K < 1 are constants that depend on the parameters of the game. The choices 
m = |~n 3 / 5 ], 77 ~ yj (mln A) Jn, and 7 ~ n" 1 / 5 ensure that, with probability at least 1 — 8, R n = 

Proof. The proof is similar to the one of Theorem 15. II A difference is that we bound the accuracy 
of the estimation of the A b via a martingale analog of Bernstein's inequality due to Freedman [15] rather 
than the Hoeffding-Azuma inequality. Also, the mixing with the uniform distribution in the first step of 
the definition of the forecaster in Figure 0] needs to be handled. 

We start by grouping time rounds m by m. Assume, for simplicity, that n = (B + l)m for some integer 
B (this accounts, again, for the m/n term in the bound). As before, we get that, for all p, 



>,p(p,H(;q n )) - $>(p t , Jt)<mJ2 (p (p, A b ) - p (p b ,A b )) 



(5) 



fc=0 



and proceed by estimating A b by A 6 . Freedman's inequality [15] (see, also, [TTJ Lemma A.l]) implies 
that for all b = 0, 1, . . . , B, i = 1, . . . , N, s G S, and S > 0, 



7 5 

(&+l)m 



m — ' 



t=bm+l 



L N 2 1JV, 2 
< , 2 In - H ln- 

y 7m 6 3 7m 5 



where /ij,t(s) is the probability mass put on s by h+j and A*(s) is the i-th component of A . This is 
because the sums of the conditional variances are bounded as 

(6+l)m , x (6+l)m 



E Var * 

t—bm+l 



Pi,t 



< 



mN 
< 

Pi,t 7 



t=bm+l 

where the second inequality follows from the lower bound 7 /AT on the components of p t (ensured by the 
mixing step in the definition of the forecaster). Summing (since the £2 projection can only help), the 
union bound shows that for all 6, with probability at least 1 — 8, 



A" - A" 



< d - y/N\S\ 



7m 8 3 7771 8 



By using uniform Lipschitzness of p in its second argument (with constant L; see Proposition IA. l| ) . we 
may further bound ((SJ) with probability 1 — (B + 1)8 by 

B B 

m^2(p(p,A b )-p(p b ,A b )) < mJ2(p(p,A b )-p(p b ,A b )+2Ld) 



6=0 



6=0 
B 



X(p {p,A b )-p(p b ,A b ))+2m{B + l)Ld 



b=0 

The terms 2m{B + l)Ld = 2nLd are the first two terms in the upper bound of the theorem. The 
remaining part is bounded by using the same slope inequality argument as in the previous section (recall 
that f denotes a sub-gradient bounded between —K and K): 

B B 

mJ2(p(p^ b ) -p(p b ,A b )) <m]Tf(p b ,A b ) • (p - p b ) . 
6=0 6=0 
Finally, we deal with the mixing with the uniform distribution: 

B B 

m 

6=0 fc=o 



,^f(p b ,A b ) • {p-p b ) < (l- 1 )mJ2r(p b ,A b y(p-p b ^+2K 1 m(B + l) 



(since, by definition, p b = (1 — j)p b + •yu) 
'In A (B + l)K 2 n 



< (1 — 7)771 
(by O) 

mm A nK 2 i] 



2K"/m(B + l) 



< 



V 



2K<yn 
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The proof is concluded by the Hoeffding-Azuma inequality which, as in ([5]), gives the final term in the 
bound. The union bound indicates that the obtained bound holds with probability at least 1 — (B + 2)S > 
\-{n/m + l)8. □ 

7. Deterministic feedback signals depending on action outcome pair In this last section 
we explain how in the case of deterministic feedback signals the forecaster of the previous section can be 
modified so that the order of magnitude of the per-round regret improves to n -1 / 3 . This relies on the 
linearity of p in its second argument. In the case of random feedback signals, p may not be linear and it 
is because of this fact that we needed to group rounds of size m. If the feedback signals are deterministic, 
such grouping is not needed and the rate n -1 / 3 is obtained as a trade-off between an exploration term 
(7) and the cost payed for estimating the feedback signals (^/ 1/(777,)). This rate of convergence has been 
shown to be optimal in [11] even in the Hannan-consistent case. The key property is summarized in the 
next technical lemma, whose proof is postponed to the appendix. 



Lemma 7.1 For every fixed p, the function p(p, •) is linear on T . 



Remark 7.1 The fact that the forecaster does not need to group rounds in the case of deterministic 
feedback signals has an interesting consequence. It is easy to see from the proofs of Proposition and 
Theorem \ 7.1\ through the linearity property stated above, that the results presented there are still valid 
when the payoff function r may change with time (even, when the environment can set it). The definition 
of the regret is then generalized as 

^ n 1 n 

R n = max min - V" r t (p, z t ) V" r t (I t , J t ) , 

where z n is the empirical distribution of the sequence of outcomes z™ — {z\ , . . . , z n ) , and the same bounds 
hold. This may model some more complex situations, including Markov decision processes. Note that 
choosing time-varying reward functions was not possible with the forecasters of \25\ since these relied 
on a crucial structural assumption on the relation between r and h. 

Next we describe the modified forecaster. Denote by TL the vector space generated by T C R' 5 '^ and 
II the linear operator which projects any element of Kj 5 ^ onto Tt. Since the p(p, •) are linear on T , we 
may extend them linearly to H (and with a slight abuse of notation we write p for the extension). As a 
consequence, the functions p{p, II(-)) defined on Rl 5 ^ are linear and coincide with the original definition 
on T . We denote by f a sub-gradient (i.e., for all A 6 Rl 5 ^, r(-, A) is a sub-gradient of p(-, 11(A))). 

The sub-gradients are evaluated at the following points. (Recall that since the feedback signals are 
deterministic, St — h(I t , Jt).) For t = 1,2,..., let 



i=l,...,N 



Pi,t 



(6) 

i=l,...,N 



The hi^t estimate the feedback signals H{i, Jt) — Sh(i t j t ) received by action i at round t. They are still 
conditionally unbiased estimators of the h(i,Jt), and so is ht for H(-,Jt). The proposed forecaster is 
defined in Figure [5] and the regret bound is established in Theorem 17.11 



Theorem 7.1 There exists a constant C only depending on r and h such that for all S > 0, 7 G (0, 1), 
and 77 > 0, the regret for any strategy of the environment is bounded, with probability at least 1 — 5, as 



N .,., ,2 2 2 NC 2 \nN r\K 2 ^ / 1 , 2 
R n < 2NC X — In - + ln- + + -!— + 2Kj + J— In - . 

77.7 o 777, nn I V 2n 



The choice 7 ~ n -i/3jv 2 / 3 and 77 - ^/(ln N)/n ensures that, with probability at least 1 — 8, R n — 

Note that here, as in Section |4] (see Remark |4.1[) . the tuning of the parameters can be done efficiently 
on-line without resorting to the "doubling trick." The optimization of the upper bound (in both 7 and 77) 
requires the knowledge of N, C, K, and 71. The first three parameters only depend on the game and are 
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Parameters: Real numbers 77,7 > 0. 
Initialization: w% = (1, . .., 1). 

For each round t = 1,2,... 

(i) choose the distribution p t = (1 — j)p t + ju, where p t is defined component-wise as 

w k ,t 

Vk.t - =]v 

E j= i wj.t 

and u denotes the uniform distribution, u = (1/N, . . . , 1/N); then draw an action I t from 
{1, . . . , N} according to p t ; 

(ii) perform the update 

wu,t+i = wu,t e n ( r ( Pt ^*))fc for each k = 1, . . . , N, 

where II is the projection operator defined after the statement of Lemma IT. 1 1 for all A £ 
R' 5 '^, ¥(■, A) is a sub-gradient of 11(A)), and ht is defined in ([6|). 



Figure 5: The forecaster for deterministic feedback signals depending on action-outcome pair. 



known or may be calculated beforehand (the proof indicates an explicit expression for C and the bound 
on the sub-gradients may be computed as explained in Section [3]). If n and/or K are unknown, their 
tuning may be dealt with by taking time-dependent 7 t and r\ t . 

Proof. The proof is similar to the one of Theorem 16.11 except that we do not have to consider 
the grouping steps and that we do not apply the Hoeffding-Azuma inequality to the estimated feedback 
signals but to the estimated rewards. By the bound on r in terms of p and convexity (linearity) of p in 
its second argument, 

n n 

np(p, H(-,q n )) - £ r(p t , J t ) < £ (p (p, H{-, J t j) - p (p t , H(; J t ))) . 
t=i t-i 

Next we estimate 

p(p,H(;J t ))-p(p t ,H(;J t )) by p(p,Il(£ t )) -p(p t ,Tl(h t )) . 

By Freedman's inequality (see, again, [111 Lemma A.l]), since ht is a conditionally unbiased estimator of 
-ff (■, Jt) and all functions at hand are linear in their second argument, we get that, with probability at 
least 1 - (5/2, 

n 

Y,(p(P,H(-,Jt))~p( Pt ,H(-,J t ))) 

n 

= J2(p n J t))) - p (Pt, n (•» J t)))) 
*=i 

S ± („ ( P , n (%)) - , ( Pl ,n IT,,))) + »c^I + 

where, denoting by e^d^i^)) the column vector whose i-ih component is <Wij) and all other components 
equal 0, 

C = max max p (p, II [ei(^/»(t,j))]) < +°° • 
1. j p 

(A more precise look at the definition of C shows that it is less than the maximal l\ norm of the 
barycentric coordinates of the points n[ej((5^(j J -))] with respect to the h(-,j).) This is because for all t, 
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the conditional variances are bounded as follows. For all p' 

N 

/ /Vv \ \ ^ 



J i—l 

N i N r 2 r< 2 AT 2 



The remaining part is bounded by using the same slope inequality argument as in the previous sections 
(recall that r denotes a sub-gradient in the first argument of p(-, II(-)), bounded between —K and K), 

n n 

Y,(p(p,n(Jh))-p(pt,n(ht)))<Y l r{pt>ht) - ip-Pt) ■ 

4=1 4=1 

Finally, we deal with the mixing with the uniform distribution, 

n n 

^r(p,h^j ■ (p-p) < (l-7)X^( p *>^ t ) ' (P~Pt) + 2K-/n 
t=i t=i 

(since by definition p t = (1 — j)p t + ju) 

< (1-7) f — + - t 2~) +2K ln (by®). 

As before, the proof is concluded by the Hoeffding-Azuma inequality {2]| and the union bound. □ 
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Appendix A. Uniform Lipschitzness of p 

Proposition A.l The function (p, A) i— ► p(p,A) is uniformly Lipschitz in its second argument. 

Proof. We consider the general case where the signal distribution depends on both the actions and 
outcomes. Accordingly, we can write p{p, A) as the solution of the following linear program (we denote 
A = (Ai, . . . , An) 6 T c V[S) N , where, as usual, we identify each Aj with a |£>|-dimensional vector): 

P(P,A) = min r(p, -) T q 
<jeiR[ s i 

s.t. H k q = A fc , k= 1,2,..., N , 

e~u q = 1 , 

<? > , 

where r(p, •) = (r(p,j))j is an M-dimensional vector, ejvf is an M-dimensional vector of ones, and 
H k = H(k, ■) is the |«S| x M matrix, whose entry (s,j) is the probability of observing signal s when action 
k is chosen and the outcome is j. 

The program is feasible for every A £ T so by the duality theorem, 

(7) 

s.t. [#*(., j) T If 2 (.,j) T ... H N (-,j) T l] y <r(p,j), i = l,2,...,M, 

y > o , 
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where we recall that H k (-,j) is the |<S|-dimensional vector whose s-th entry is the probability of observing 
signal s if the action is k and the outcome is j. 

We first claim that A i— > p(p, A) is Lipschitz for every fixed p. Indeed, for every fixed p the opti- 
mization problem involves A only through the objective function. We thus have that the solution to the 
optimization problem is obtained at one of finitely many values of y (the vertices of the feasible cone 
defined by the constraints of program 0). (More precisely, the obtained cone may be unbounded if there 
are some unconstrained components of y. This happens when there exists an s such that H k (s,j) = 
for all j. But then A^(s) = as well and we do not care about the unbounded component (k — 1)N + s 
of y.) Since p(p, •) is a maximum of finitely many linear functions we obtain that it is Lipschitz, with 
Lipschitz constant bounded by the maximal l\ norm of the vertices of the feasible cone of (J7J). 

We now prove that the Lipschitz constant is uniform with respect to p. It suffices to consider the 
polytope defined by 

y>0, [H l {;j) T H\;jy ...H N (;j) T l] y<l, j 1.2 1/} . 

This is a cone, and the vertex y with the maximum t\ norm upper bounds the Lipschitz constant of 
the •), for all p. (As before, any unbounded components of y do not matter to the optimization 
problem.) □ 

Remark A.l Observe from the proof that an upper bound on the uniform Lipschitz constant can be 
easily computed by solving the following linear program, 

max elra , y 
yglRivisi + i NS+iy 

S.t. [H l (;j) T H 2 (;j) T ... H N (;j) T l] y <1, j = 1, 2, . . . , M , 

y > o. 

Appendix B. Proof of Lemma 17.11 It is equivalent to prove that for all fixed p, the function 
q i— ► p(p, H{-, q)) is linear on the simplex. Actually, the proof exhibits a simpler expression for p. 

To this end, we first group together the outcomes with same feedback signals and define a mapping 

T:V({1,...,M})->V({1,...,M}) , 

where V({1, . . . , M}) is the set of all probability distributions q on the outcomes. Formally, consider the 
binary relation defined by j = j' if and only if h(-,j) = h(-, j'). (We use here the notation h to emphasize 
that we deal with deterministic feedback signals.) Denote by Fx, . . . , Fm> the partition of the outcomes 
{1, . . . , M} obtained so, and pick in every Fj the outcome yj with minimal reward r(p,yj) against p 
(ties can be broken arbitrarily, e.g., by choosing the outcome with lowest index). Then, for every q, the 
distribution q' = T(q) is defined as q' y . = J2 y eF lyi f° r 3 ' = 1' ■ ■ ■ > M' , and q' k = if k ^ yj for all j. 

T is a linear projection (i.e., T oT = T). It is easy to see that in the case of deterministic feedback 
signals, H(-,q) = H(-, q') if and only if T(q) = T(q'). This implies that 

p(p,H(;q))= min r(p, <?') = r(p, T(q)) (8) 

q' :T(q')=T{q) 

where the last equality follows from the fact that, by choices of the yj, r(p,q') > r(p,T(q')) for all q', 
with equality for q' = T(q) — T 2 {q). By linearity of T, q i— ► r(p, T(q)) — p(p 7 H(-,q)) is therefore linear 
itself, as claimed. 

Note that the equivalence of H{-,q) — H{-,q') and T(q) = T{q'), together with ©, implies the 
following sufficient condition for Hannan-consistency (for necessary and sufficient conditions, see [231111] ). 
It is more general than the distinguishing actions condition of jTT] . 

Remark B.l Whenever H has no two identical columns in the case of deterministic feedback, i.e., 
7^ for all j ^ j' , one has that for all p and q, 

p(p,H(-,q)) = r(p,q) . 



The condition is satisfied, for instance, for multi-armed bandit problems, where h = r (provided that we 
identify outcomes yielding the same rewards against all decision-maker's actions). 
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